An On-Chip Multiprocessor Architecture with a Non-Blocking Synchronization Mechanism
نویسندگان
چکیده
tive to superscalar architectures [5][8][12][13]. Strengths of an on-chip MP architecture are threefold. First, an MP can exploit different level parallelism, thread-level parallelism (TLP), in addition to ILP. Second, the complexity can be suppressed using simple processors. This ensures a high clock rate. Third, communication latency can be significantly reduced using an on-chip network. These strengths have possibility that MPs outperform superscalar processors. On the other hand, the weakness is that a traditional MP is not efficient enough to exploit TLP in integer programs, the most important applications for computers. In general, coarse-grained TLP is little available due to complex and frequent control and data dependences in an integer program; only fine-grained TLP is available. Performance improvement by exploiting fine-grained TLP requires efficiency. One of critical problems in the efficiency is the reduction of overhead of communication and synchronization among processors. An on-chip network is obviously beneficial to this problem, but this is not enough because communication and synchronization through memory locations is still large overhead [5]. Register-based communication and synchronization mechanisms [2][5][12] are proposed to further reduce the overhead , ins tead o f t r ad i t iona l memory-based mechanisms. The register-based mechanisms directly communicate register values among processors, eliminating memory accesses. These mechanisms also support fast synchronization on the register file. Although register-based mechanisms significantly reduce the overhead, they still have a problem: those mechanisms interfere with exploiting ILP within a single thread at the expense of the overhead reduction of synchronization. Abstract
منابع مشابه
The Elephant and the Mouse: Non-Strict Fine-Grain Synchronization for Many-Core Architectures
A new synchronization mechanism created under the dataflow model of computation was introduced during the late 1970s and called I-Structure. I-Structure exhibited the following important features: (1) it is a dataflow style synchronization, i.e., synchronization only occurs between an I-Structure producer and consumer operations that are accessing the same memory location; (2) it is fine-grain ...
متن کاملNon-Blocking Routers Design Based on West First Routing Algorithm & MZI Switches for Photonic NoC
For the first time, the 4- and 5-port optical routers are designed by using the West First routing algorithm for use in optical network on chip. The use of the WF algorithm has made the designed routers to provide non-blocking routing in photonic network on chip. These routers not only are based on high speed Mach-Zehnder switches(Which have a higher bandwidth and more thermal tolerance than mi...
متن کاملNon-Blocking Routers Design Based on West First Routing Algorithm & MZI Switches for Photonic NoC
For the first time, the 4- and 5-port optical routers are designed by using the West First routing algorithm for use in optical network on chip. The use of the WF algorithm has made the designed routers to provide non-blocking routing in photonic network on chip. These routers not only are based on high speed Mach-Zehnder switches(Which have a higher bandwidth and more thermal tolerance than mi...
متن کاملEfficient Fine Grained Synchronization Support Using Full/Empty Tagged Shared Memory and Cache Coherency
Performance results of machines with fine-grain synchronization on individual lock-free data items (e.g., words), such as the MIT Alewife multiprocessor, illustrate the benefits of supporting fine-grain synchronization. The performance benefits are primarily the result of allowing a dataflow style of computation in programming models, and maximizing the exposed parallelism by minimizing the pos...
متن کاملPerformance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture
This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. The study inclu...
متن کامل